Virtual screening of bioassay data

نویسنده

  • Amanda C. Schierz
چکیده

BACKGROUND There are three main problems associated with the virtual screening of bioassay data. The first is access to freely-available curated data, the second is the number of false positives that occur in the physical primary screening process, and finally the data is highly-imbalanced with a low ratio of Active compounds to Inactive compounds. This paper first discusses these three problems and then a selection of Weka cost-sensitive classifiers (Naive Bayes, SVM, C4.5 and Random Forest) are applied to a variety of bioassay datasets. RESULTS Pharmaceutical bioassay data is not readily available to the academic community. The data held at PubChem is not curated and there is a lack of detailed cross-referencing between Primary and Confirmatory screening assays. With regard to the number of false positives that occur in the primary screening process, the analysis carried out has been shallow due to the lack of cross-referencing mentioned above. In six cases found, the average percentage of false positives from the High-Throughput Primary screen is quite high at 64%. For the cost-sensitive classification, Weka's implementations of the Support Vector Machine and C4.5 decision tree learner have performed relatively well. It was also found, that the setting of the Weka cost matrix is dependent on the base classifier used and not solely on the ratio of class imbalance. CONCLUSIONS Understandably, pharmaceutical data is hard to obtain. However, it would be beneficial to both the pharmaceutical industry and to academics for curated primary screening and corresponding confirmatory data to be provided. Two benefits could be gained by employing virtual screening techniques to bioassay data. First, by reducing the search space of compounds to be screened and secondly, by analysing the false positives that occur in the primary screening process, the technology may be improved. The number of false positives arising from primary screening leads to the issue of whether this type of data should be used for virtual screening. Care when using Weka's cost-sensitive classifiers is needed - across the board misclassification costs based on class ratios should not be used when comparing differing classifiers for the same dataset.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Optimized Preprocessing for Accurate and Efficient Bioassay Prediction with Machine Learning Algorithms

Bioassay is the measurement of the potency of a chemical substance by its effect on a living animal or plant tissue. Bioassay data and chemical structures from pharmacokinetic and drug metabolism screening are mined from and housed in multiple databases. Bioassay prediction is calculated accordingly to determine further advancement. This paper proposes a four-step preprocessing of datasets for ...

متن کامل

Bioassay Screening of the Essential Oil and Various Extracts of Fruits of Heracleum persicum Desf. and Rhizomes of Zingiber officinale Rosc. using Brine Shrimp Cytotoxicity Assay

In the present work, the bioassay screening of the essential oil and various extracts of two plants including fruits of Heracleum persicum Desf. and rhizomes of Zingiber officinale Rosc. have been studied with brine shrimp test. There is only one report about cytotoxicity of H. sphondylium in literature and so H. persicum has been used as second selection. At first essentials oil and various ex...

متن کامل

Bioassay Screening of the Essential Oil and Various Extracts of Fruits of Heracleum persicum Desf. and Rhizomes of Zingiber officinale Rosc. using Brine Shrimp Cytotoxicity Assay

In the present work, the bioassay screening of the essential oil and various extracts of two plants including fruits of Heracleum persicum Desf. and rhizomes of Zingiber officinale Rosc. have been studied with brine shrimp test. There is only one report about cytotoxicity of H. sphondylium in literature and so H. persicum has been used as second selection. At first essentials oil and various ex...

متن کامل

GPURFSCREEN: a GPU based virtual screening tool using random forest classifier

BACKGROUND In-silico methods are an integral part of modern drug discovery paradigm. Virtual screening, an in-silico method, is used to refine data models and reduce the chemical space on which wet lab experiments need to be performed. Virtual screening of a ligand data model requires large scale computations, making it a highly time consuming task. This process can be speeded up by implementin...

متن کامل

AMDD: Antimicrobial Drug Database

Drug resistance is one of the major concerns for antimicrobial chemotherapy against any particular target. Knowledge of the primary structure of antimicrobial agents and their activities is essential for rational drug design. Thus, we developed a comprehensive database, anti microbial drug database (AMDD), of known synthetic antibacterial and antifungal compounds that were extracted from the av...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره 1  شماره 

صفحات  -

تاریخ انتشار 2009